In this recipe we look at how to find the longest stretch of a specified set of characters in a BiologicalSequence object. In this particular case we'll work with NucleotideSequences, and look for stretchs of purines and pyrimidines. In this context, I use the term stretch to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the length and the start position of the longest stretch of various character sets.

First, we'll configure the environment. We'll be using skbio and itertools for this.


In [1]:
from __future__ import print_function
import itertools

from skbio import parse_fasta, NucleotideSequence

Next we'll define our character sets of interest. We'll define the set of purines and pyrimidines here.


In [2]:
purines = set('AG')
pyrimidines = set('CTU')

I'm going to obtain a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same.


In [3]:
id_, seq = list(parse_fasta(open('data/single_sequence1.fasta')))[0]
n = NucleotideSequence(seq, id=id_)
n


Out[3]:
<NucleotideSequence: GAGTTTGATC... (length: 1541)>

Here's where it gets interesting - we'll define a longest_stretch function that takes a BiologicalSequence object and the characters of interest, and returns the length of the longest contiguous stretch of the characters of interest, as well as the start position of that stretch of characters. (And of course you could compute the end position of that stretch by summing those two values, if you were interested in getting the span.)


In [4]:
def longest_stretch(sequence, characters_of_interest):
    # initialize some values
    current_stretch_length = 0
    max_stretch_length = 0
    current_stretch_start_position = 0
    max_stretch_start_position = -1
    
    # this recipe was developed while reviewing this SO answer:
    # http://stackoverflow.com/a/1066838/3424666
    for is_stretch_of_interest, group in itertools.groupby(sequence, 
                                                           key=lambda x: x in characters_of_interest):
        current_stretch_length = len(list(group))
        current_stretch_start_position += current_stretch_length
        if is_stretch_of_interest:
            if current_stretch_length > max_stretch_length:
                max_stretch_length = current_stretch_length
                max_stretch_start_position = current_stretch_start_position
    return max_stretch_length, max_stretch_start_position

We can apply this to find the longest stretch of purines...


In [5]:
longest_stretch(n, purines)


Out[5]:
(11, 141)

We can apply this to find the longest stretch of pyrimidines...


In [6]:
longest_stretch(n, pyrimidines)


Out[6]:
(6, 181)

Or the longest stretch of some other character or characters.


In [7]:
longest_stretch(n, set('N'))


Out[7]:
(1, 573)

In this case, we try to find a stretch of a character that doesn't exist in the sequence.


In [8]:
longest_stretch(n, set('X'))


Out[8]:
(0, -1)

NOTE: In the future we'll be adding direct support for this functionality to the BiologicalSequence object. You can track progress on that here.